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Abstract 

In  this  paper  a  recurrent  Newton  algorithm  for  an  important  class  of  recurrent  neural 
networks  is  introduced.  It  is  noted  that  a  suitable  constraint  must  be  imposed  on  recurrent 
variables  to  ensure  proper  convergence  behavior.  The  simulation  results  show  that  the 
proposed  Newton  algorithm  with  the  suggested  constraint  perform  uniformly  better  than 
the  back-propagation  algorithm  and  the  Newton  algorithm  without  the  constraint,  in 
terms  of  mean-squared  errors. 


1      Introduction 

It  has  been  recognized  that  feedforward  neural  networks  may  have  difficulty  in  represent- 
ing certain  sequential  behavior  of  a  target  sequence,  [1].  This  deficiency  hampers  the 
applications  of  feedforward  networks  in  the  fields,  such  as  signal  processing  and  dynamic 
control,  in  which  temporal  structure  plays  an  important  role.  Researchers  are  therefore 
motivated  to  study  the  so-called  recurrent  networks,  e.g.,  [l]-[8].  A  recurrent  network  can 
be  obtained  from  a  feedforward  network  by  permitting  additional,  internal  feedback  chan- 
nels, hence  is  capable  of  capturing  more  dynamic  characteristics  than  does  a  feedforward 
network. 

Owing  to  the  existence  of  internal  feedbacks  in  recurrent  networks,  learning  algorithms 
for  feedforward  networks  are  not  directly  applicable.  Kuan,  Hornik,  &  White  [9]  propose 
a  recurrent  back-propagation  (BP)  algorithm  and  establish  its  almost  sure  convergence 
property  under  the  condition  that  feedback  connections  are  suitably  constrained.  Because 
the  recurrent  BP  algorithm  also  performs  a  gradient  search  in  the  parameter  space,  it, 
as  other  gradient-based  learning  algorithms,  converges  very  slowly.  However,  it  is  well 
known  in  system  identification  literature  that  a  Newton  algorithm  is  computationally 
and  statistically  more  efficient  than  gradient-search  algorithms.  In  this  paper,  we  first 
introduce  a  recurrent  Newton  algorithm  for  an  important  class  of  recurrent  networks  and 
then  sketch  its  almost  sure  convergence  property.  Similar  to  the  recurrent  BP  algorithm, 
the  recurrent  variables  must  be  constrained  suitably  in  the  recurrent  Newton  algorithm  to 
ensure  meaningful  convergence.  Our  simulation  results  strongly  indicate  that  the  recurrent 
Newton  algorithm  with  the  suggested  constraint  yields  better  convergence  results  than  the 
recurrent  BP  algorithm  and  the  Newton  algorithm  without  the  constraint. 

This  paper  is  organized  as  follows.  We  first  introduce  a  class  of  recurrent  networks 
in  section  2.  The  recurrent  Newton  algorithm  and  the  constraint  needed  for  recurrent 
variables  are  discussed  in  section  3.  Simulation  results  are  reported  in  section  4.  The 
paper  is  concluded  by  Section  5.  An  example  of  the  Newton  algorithm  is  given  in  the 
Appendix. 


2      Recurrent  Network 

Let  0,  H ,  X,  and  i?  denote  column  vectors  of  m  network  outputs,  q  hidden  unit  acti- 
vations, n  network  inputs,  and  p  internal  feedbacks,  respectively.  The  elements  of  these 
vectors  are  denoted  using  corresponding  lower  cases.  At  time  t,  a  recurrent  network  with 
a  single  hidden  layer  and  delayed  internal  feedbacks  can  be  represented  in  the  following 
generic  form: 

0t     =     *{b  +  fl'Ht), 

Ht     =     Vic  +  f'Xt  +  S'Rt),  (1) 

where  $  and  $  are  vector- valued. functions,  and 

Rt    =    A{Xt-i,Rt-i;W),  (2) 

with  A  also  a  vector- valued  function  and  W  the  /.'-dimensional  vector  of  network  connection 
weights  6,  /3,  c,  7,  and  6.  If  Rt  is  chosen  to  be  Ot-i,  (1)  is  a  network  with  output  feedbacks, 
Jordan  [1];  if  Rt  is  chosen  to  be  Ht-i,  it  is  a  network  with  hidden-unit  activations  feedbacks, 
Elman  [8].  When  Rt  =  0,  this  network  simply  reduces  to  a  feedforward  network.  The 
fully- recurrent  network  of  [7]  can  also  be  defined  in  a  similar  way  with  suitably  defined 
Rt  entering  all  the  units.  In  this  paper,  however,  we  will  confine  ourselves  to  the  class  of 
recurrent  networks  described  in  (1)  and  (2). 

Let  Yt  denote  the  vector  of  m  target  variables.  In  a  dynamic  environment  with  time 
series  data,  important  temporal  structures  are  embedded  in  lagged  targets  Yt-i,  Yt-2,  etc. 
Thus,  it  is  quite  typical  to  use  lagged  targets  as  inputs  for  feedforward  networks  to  capture 
dynamics.  Clearly,  networks  with  too  few  lagged  targets  will  not  be  able  to  capture  certain 
temporal  structures  that  depend  on  a  long  history  of  targets.  On  the  other  hand,  storing 
all  the  past  information  in  memory  is  practically  implausible.  This  situation  is  similar 
to  building  a  linear  AR  (autoregressive)  model  in  which  a  suitable  number  of  AR  lags 
is  typically  difficult  to  determine.  This  difficulty  can  be  circumvented  if  a  network  has  a 
"memory"  device  to  store  past  information  compactly.  Recurrent  networks  by  construction 
have  this  property.  To  see  this,  note  that  by  recursive  substitution, 

Rt     =     A(Xt-uRt-uW) 

=     A(Xt-U\(Xt_2,Rt_2;W);W) 


=    £(X<-\WO,  (3) 

where  AT'-1  =  {Xo,X\, . . .  ,  At_i}  is  the  collection  of  all  previous  inputs.  As  Rt  depends 
on  the  entire  history  of  inputs  and  all  the  connection  weights  in  a  complex  manner,  intro- 
ducing recurrent  variables  to  a  feedforward  network  is  somewhat  similar  to  adding  "moving 
average"  terms  to  AR  models.  (In  what  follows,  we  also  write  Rt  as  Rt{W)  to  signify  its 
parameter  dependence.)  Thus,  recurrent  variables  serve  to  summarize  past  information 
in  a  compact  form,  in  terms  of  network  outputs  or  hidden-unit  activations.  A  recurrent 
network  may  therefore  be  interpreted  as  a  parsimonious  network  model  which  incorporates 
all  the  past  network  inputs  without  storing  all  of  them  in  memory.  It  is  this  property  that 
makes  recurrent  networks  attractive  in  dynamic  applications. 

From  (1)  and  (2)  we  can  write  the  output  of  a  recurrent  network  as 

0t     =     ${b  +  f3'V(c  +  yXt  +  6'Rt(W))) 
=  :     F{XuRt(W)-W)- 

or  by  (3),  Ot  —'.  /(A*;  W)  is  also,  a  function  of  the  entire  history  of  network  inputs. 
Given  the  mean-squared  error  (MSE)  objective  function,  the  parameters  of  interest  are 
W*  which  minimize 

lim  E(yt-/(Xt;W))a 

t— >oo 

=      lim  TE(Yt  -  JEiYtlX'))2  +    lim  E(IE(rt| A'*)  -  /(A(;  W))2;  (4) 

t— *oo  t— »oo 

here,  the  limit  is  taken  to  permit  system  feedbacks.  Observe  that  the  first  term  on  the 
right-hand  side  of  (4)  is  an  intrinsic  error  which  does  not  depend  on  W .  Hence,  W* 
minimizes  the  limit  of  the  approximation  error:  IE( IE( V^l A"£ )  -  f{Xt\W))2  in  (4).  It  is 
well  known  that  IE(y^|A()  is  the  best  Z/2-predictor  of  Yt  given  the  a-algebra  generated  by 
Xf,  denoted  as  a{Xl).  As  r£(}^|Aj)  is  measurable  with  respect  to  <r(A'f), 

JE(Yt  -  E(y,|Ae))2  <  ]E(Yt  -  lE(Yt\Xt))2. 

Therefore,  a  recurrent  network  may  characterize  the  behavior  of  Yt  better  by  approximat- 
ing the  conditional  mean  function  E(y(|A<),  whereas  a  feedforward  network  with  input 
Xt  can  only  approximate  E(yi|A«). 


3      A  Recurrent  Newton  Algorithm 

It  should  be  clear  that  any  learning  algorithm  (on-line  or  off-line)  obtained  from  the 
objective  function  (4)  must  take  into  account  the  fact  that  Rt(W)  depends  on  network 
connection  weights.  Let  Et  be  the  network  error,  i.e.,  Et  =  Yt  —  F(Xt,  Rt(W);  W),  which 
also  depends  on  W  directly  and  indirectly  through  the  presence  of  Rt(W).  The  derivatives 
of  E  with  respect  to  W  are,  by  chain  rule, 

VEt     =     -Fw(XuRt{W)-W)  -  VRt(W)  FR(Xt,Rt(W);W), 

"> v '  v v '    S v ' 

kxm  kxp  pXm 

where  Fw  and  Fr  are  matrices  of  the  first  order  derivatives  of  F  with  respect  to  W  and 
R,  respectively.  Because  the  BP  algorithm  for  feedforward  networks  contains  only  the 
term  Fw ,  it  is  clear  that  it  does  not  follow  a  correct  gradient  direction  when  recurrent 
variables  Rt(W)  are  present.  Therefore,  it  is  extremely  important  for  an  algorithm  in 
recurrent  networks  to  incorporate  the  additional  term,  V Rt(W)  Fr,  so  as  to  maintain  a 
correct  search  direction.  A  learning  algorithm  that  ignores  this  term  need  not  converge 
to  a  MSE  minimizer.  In  light  of  (3),  it  is  evident  that  computation  of  VRt(W)  requires 
all  the  past  inputs.  Hence,  such  computation  becomes  practically  formidable  because  all 
the  past  inputs  must  be  stored  in  memory  and  because  computation  will  increase  with  t. 
This  problem  can  be  circumvented  by  recursively  approximating  VRt{W)  via 

VRt(W)    =     Aw{Xt-uRt-i{W);W)  +  VRt.x{W)  AR(Xt_uRt_x(W)-\V), 

" V '  * V '      V V ' 

kxp  kxp  pXp 

where  Aw  and  Ar  are  matrices  of  the  first  order  derivatives  of  A  with  respect  to  W  and 
R,  respectively.  This  observation  motivates  the  recurrent  BP  algorithm  studied  in  [9]. 

As  a  gradient  descent  algorithm,  the  BP  algorithm  for  feedforward  networks  converges 
very  slowly  and  is  statistically  inefficient,  see  e.g.,  [10].  The  recurrent  BP  algorithm 
therefore  has  the  same  disadvantage.  In  numerical  optimization,  better  convergence  results 
can  be  obtained  from  the  Newton  method  with  a  search  direction  based  on  the  second-order 
derivatives  (the  Hessian  matrix).  When  the  objective  function  is  quadratic,  the  Newton 
method  converges  to  the  minimum  of  the  objective  function  in  one  iteration.  To  ensure  that 
the  search  direction  always  points  "downhill",  it  is  also  typical  to  use  a  positive-definite 
matrix,  such  as  the  outer  product  of  the  gradient  vector,  to  approximate  the  Hessian 


matrix.  In  view  of  this,  a  natural  extension  of  the  recurrent  BP  algorithm  is  an  algorithm 
analogous  to  the  Newton  method  in  numerical  optimization.  This  type  of  algorithms  is  well 
known  in  system  identification  literature,  e.g.,  Ljung  &  Soderstrom  [11].  It  is  shown  in  [10] 
that  the  stochastic  Newton  learning  algorithm  for  feedforward  networks  is  computationally 
and  statistically  more  efficient  than  the  BP  algorithm;  in  particular,  it  is  asymptotically 
equivalent  to  the  nonlinear  least  squares  estimator  under  very  general  conditions.  In 
what  follows,  a  variable  is  written  with  the  "hat"  symbol  if  it  is  evaluated  at  parameter 
estimates.  Bearing  the  issue  of  taking  gradients  correctly  in  mind,  a  straightforward 
Newton  algorithm  is  as  follows. 

E,  =  Yt-F(XuRt;Wt),  (5) 

VEt  =  -Fw{XuRf,Wt)-  btFR(XuRt;Wt),  (6) 

Wt+l  =  Wt-VtG;^(VEtEt),  (7) 

Gt+i  =  Gt  +  Vt{VEtVE't-Gt),  (8) 

Rt+1  =  A(Xt,Rf,Wt),  (9) 

A+i  =  Aw(Xt,Rt;Wt)  +  DtAR(XuRuWt).  (10) 

where  {77*}  is  a  sequence  of  learning  rates  of  order  l/t,  and  Dt  is  used  in  lieu  of  VRt. 
Clearly,  if  R  is  not  present  in  the  network,  a  recurrent  network  is  just  a  feedforward 
network,  and  the  recurrent  Newton  algorithm  simply  reduces  to  the  Newton  algorithm  for 
feedforward  networks,  [10].  In  contrast  with  the  recurrent  BP  algorithm,  the  recurrent 
Newton  algorithm  contains  an  additional  updating  equation  (8)  which  recursively  updates 
the  outer  product  of  VEt  so  as  to  provide  an  approximate  Newton  direction  for  (7). 

There  are  some  basic  difficulties  associated  with  the  proposed  algorithm.  First,  (7) 
involves  matrix  inversion  which  is  not  desirable  in  a  recursive  algorithm.  Second,  Gt 
must  be  a  positive  definite  matrix  to  assure  correct  search  direction.  In  practice,  some 
modifications  are  needed  to  avoid  these  difficulties.  Let  Pt+\  —  r)tG~[+x  and  vt  —  ( 1  — 
Vt)Vt-i/Vt-  Then  by  a  matrix  inversion  formula, 

Pt+1   =   -(Pt-PtVEt(VE'tPtVEt  +  utrlVE'tPt). 

For  the  network  with  a  single  output  (i.e.,  m  =  1)  so  that  V Et  is  k  x  1, 

PtVEtVE'tPt   \ 


VE'tPtVEt  +  vtl 


(ID 


which  does  not  involve  matrix  inversion.  For  the  network  with  multiple  outputs  (i.e., 
m  >  1),  we  follow  the  approach  of  Bierman  [12]  and  compute  Pt  using  a  sequence  of 
single-output  algorithms  in  which  no  matrix  inversion  is  needed.  This  leads  to  a  modified 
algorithm  which  contains  (5),  (6),  (9),  and  (10)  in  the  original  version  but  substitutes  the 
updating  equations  below  for  (7)  and  (8): 

Wt+l     =     Wt-Pt+l(VEtEt),  (12) 

P&\     =     Pu  (13) 


ti-i)  _  Hi-il)VEj,tVE'j,tP&1) 

VE'JitPli-l)VE]tt  +  vt 

A+i     =     /#?/«*.  (15) 


Pi+i,  if  Pt+i  -  elk  is  p.s.d., 

-n+i     —     S     -  (lo) 

Pt+i  +  Mt+i,    otherwise, 

where  V Eht  is  the  j-th  column  of  VEt,  "p.s.d"  stands  for  positive  semidefinite,  f  is  a 
small  positive  constant,  and  Mt+\  is  chosen  to  make  Pt+\  —  eh  a  p.s.d.  matrix.  Note 
that  in  (13)— (15),  Pt+i  is  updated  as  each  output  unit  is  added  sequentially,  and  that 
(16)  implements  a  correction  ensuring  Pt  to  be  a  p.s.d.  matrix.  This  modified  version  is 
analogous  to  the  Newton  algorithm  considered  in  system  identification  literature;  for  more 
details  and  related  numerical  issues  of  the  Newton  algorithm  we  refer  to  [11,  Chap.  6].  A 
simple  example  of  this  modified  algorithm  is  given  in  the  Appendix. 

Let  9  :=  [W  (vec /*)']'  be  the  s-dimensional  column  vector  of  parameters,  where  the 
vec  operator  stacks  all  columns  of  a  matrix  into  a  column  vector,  and  8t  be  its  estimate. 
To  prevent  6t  from  diverging  to  infinity,  it  is  also  standard  to  impose  a  projection  device 
■k  on  these  estimates.  Given  a  compact  parameter  space  0,  if  9t  6  0,  x{9t)  =  9t;  if  9t  $  0, 
x(9t)  takes  a  value  in  0.  The  recurrent  Newton  algorithm  proposed  in  this  paper  is  the 
modified  algorithm  discussed  above  together  with  a  truncation  device  on  9t.  In  practice, 
we  choose  large  truncation  bounds  for  bt,  0t,  ct,  jt,  and  Pt  so  that  the  behavior  of  these 
estimates  is  virtually  not  affected;  to  ensure  proper  convergence  behavior,  however,  some 
restrictive  bounds  on  the  estimates  of  recurrent  connection  weights,  6t,  are  needed.  These 
bounds  are  discussed  below. 

The  almost  sure  convergence  property  of  9t  can  be  proved  by  combining  the  results 
in  Kuan  &  White  [10,  13],  which  are  based  on  the  compactness  approach  of  the  ODE 
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(ordinary  differential  equation)  method  due  to  [14];  see  also  [9].  To  reduce  technicality, 
we  do  not  provide  a  formal  theorem  but  only  sketch  this  convergence  property.  Write 

Zt(9)  =  [Yt'  X{  Rt(9)' (vec  Dt(9))'}' 
and  Zt  =  Zt(9t)-  Then  the  updating  equations  ( 12)— ( 16)  can  be  written  compactly  as 

Ot+x  =0t  +  VtQ(ZfJt).  (17) 

Let 

Q(0)=  limIE[Q(Zt(0);0)]; 

t— >oo 

note  that  it  is  just  the  first  order  condition  of  (4).  Under  very  general  conditions  on 
the  data  Yt  and  Xt  and  network  functions  $  and  #,  if  the  recurrent  function  A  is  a 
contraction  mapping  in  R  (i.e.,  for  each  x  and  0,  |A/?(x,.;#)|  <  «o  <  1),  it  can  be  shown 
that  9t  eventually  behave  like  the  solution  path  of  the  ODE  9  =  Q{9)  and  converge  with 
probability  one  to  the  set  of  all  locally  asymptotically  stable  equilibria  in  0  for  this  ODE. 
If  this  set  contains  only  finitely  many  point,  0t  converges  to  one  locally  asymptotically 
stable  equilibrium,  hence  a  local  minimum  of  (4). 

Typically,  the  aforementioned  convergence  property  holds  when  $  and  #  are  continu- 
ously differentiate  of  order  two;  most  of  commonly  used  network  functions,  such  as  the 
logistic  or  hyperbolic  tangent  functions,  satisfy  this  property.  On  the  other  hand,  the 
contraction  mapping  condition  on  A  is  crucial  and  is  not  satisfied  automatically.  In  view 
of  (3),  Rt  could  "explode"  if  A  is  not  a  contraction  mapping,  because  the  effects  of  xt 
would  accumulate  very  rapidly.  Even  when  A  is  a  bounded  (or  squashing)  function,  this 
condition  is  still  needed;  otherwise,  Rt  would  be  approaching  to  the  upper  or  lower  bound 
of  A  in  a  short  learning  period.  For  example,  given  an  Elman  network  so  that  A  =  $,  if  $ 
is  the  logistic  function,  then  hidden  unit  activations  would  be  close  to  zero  or  one  if  9  is 
not  a  contraction  mapping  in  lagged  hidden-unit  activations.  This  causes  "exaggeration" 
of  the  behavior  of  hidden  units  and  invalidates  the  learning  results.  If  A  is  a  contraction 
mapping,  the  recurrent  variables  in  fact  implement  an  exponentially  forgetting  memory 
of  the  data  sequence  and  are  well  behaved  essentially.  (We  note  that  the  contraction 
mapping  requirement  of  A  is  similar  to  the  "invertibility"  condition  for  time  series  models 


with  moving  average  terms.)  Let  M$  =  supe$'(e)  and  My  =  supu  $'(«).  From  [9],  the 
contraction  mapping  property  for  the  Jordan  network  is  satisfied  when 

2iWii<(^Miri, 

where  |.|  stands  for  Euclidean  norm,  and  for  the  Elman  network,  it  is  satisfied  when 

That  is,  the  connection  weights  must  be  suitably  constrained  during  the  learning  process 
so  as  to  ensure  proper  convergence  behavior.  Note  that  in  the  Jordan  network  the  con- 
nection weights  /3's  must  be  restricted;  hence  the  representability  of  the  Jordan  network 
is  unavoidably  affected  by  this  constraint.  In  the  Elman  network,  however,  only  recur- 
rent connections  are  subject  to  the  constraint  so  that  feedforward  part  of  the  network 
is  not  affected.  Thus,  as  far  as  the  representation  capability  of  a  network  is  concerned, 
the  Elman  network  seems  to  be  more  desirable  since  less  network  connection  weights  are 
restricted.  It  is  straightforward  to  verify  that  some  sufficient  conditions  ensuring  the  con- 
traction mapping  property  in  the  Elman  network  are  that  \6ij\  <  4/q  for  all  i  and  j  if  # 
is  the  logistic  function  and  that  |£,j|  <  \/q  for  all  i  and  j  if  ty  is  the  hyperbolic  tangent 
function.  We  stress  that  such  restrictions  are  not  only  of  theoretical  interest  but  also  of 
practical  importance,  as  shown  in  the  simulation  results  below. 

4      Simulations 

To  evaluate  the  performance  of  the  proposed  Newton  algorithm,  we  conduct  the  following 
simulations.  The  target  variables  yt,  t  =  1,.  ..,T,  are  generated  from  three  models:  (1)  a 
bilinear  model: 

yt  =  QAyt-i  -  0.3^-2  +  0.5yt-i€t-i  +  u, 

where  €t  are  independent  ./V(0, 1);  (2)  a  Henon  map: 

xt     =     0.3yt_i, 

yt     =     1  +  xt_i  -  1.4yt2_l5 


where  yo  =  —lxu  and  j/_i  =  0.5  x  u,  u  is  the  uniform  random  variable  on  [0, 1];  (3)  a 
SETAR  (self- exciting  threshold  autoregressive)  model: 

f      0.9t/t-i  +  Q,     |ft-i|<l, 

yt  =  < 

[   -0.3yt_!+Q,     |y(_!|  >  1, 

where  et  are  independent  iV(0, 1).  In  the  first  two  models  yt  depend  on  its  own  past  values; 
in  the  SETAR  model  yt  depends  only  on  yt-\-  We  include  the  third  model  to  see  how 
the  algorithms  perform  when  a  recurrent  network  is  not  really  needed.  In  the  simulations, 
the  sample  size  T  is  1000,  and  the  number  of  replications  is  200.  The  network  inputs  are 
lagged  target  variables  yt-\  and  yt-?-  The  network  activation  function  $  is  the  identity 
function  and  $  is  the  logistic  function.  We  estimate  the  Elman  network  with  4-6  hidden 
units  using  four  algorithms:  the  BP  and  Newton  algorithms,  each  with  and  without  the 
constraint  \6{j\  <  3.995/^,  where  q  is  the  number  of  hidden  units.  The  initial  feedforward 
connection  weights  (/?'s  and  7's)  are  generated  from  ./V(0,1)  and  recurrent  connection 
weights  (<S's)  are  generated  from  10  X  A(0, 1).  This  allows  us  to  assess  the  effectiveness  of 
the  proposed  constraint  more  easily. 

In  the  simulation  MSE  at  each  recursive  step  is  recorded  and  averaged  over  200  repli- 
cations. The  averages  of  MSE's  from  the  last  500  recursive  steps  and  the  last  MSE's  in 
the  final  (1000-th)  recursive  step  are  summarized  in  Table  1.  We  observe  from  this  table 
that: 

1.  For  all  cases  considered,  the  Newton  algorithm  with  the  suggested  constraint  yields 
lowest  average  and  last  MSE,  and  the  BP  algorithm  without  the  constraint  yields 
the  highest  average  and  last  MSE.  The  Newton  algorithm  without  the  constraint 
performs  even  better  than  the  recurrent  BP  algorithm  with  the  constraint. 

2.  The  average  and  last  MSE  of  the  (Newton  and  BP)  algorithms  without  the  constraint 
may  be  increasing  with  the  number  of  hidden  units.  Except  for  the  SETAR  model, 
the  average  and  last  MSE  of  the  algorithms  with  the  constraint  decreases  with  the 
number  of  hidden  units. 

The  first  result  shows  that  the  Newton  algorithm  with  the  suggested  constraint  performs 
uniformly  better  than  the  other  three  algorithms  in  terms  of  MSE's.  It  is  also  interesting 
to  note  from  the  second  result  that  adding  more  hidden  units  need  not  result  in  lower  MSE 


if  a  learning  algorithms  is  used  without  the  constraint.  In  the  SETAR  model,  a  recurrent 
network  is  not  really  needed;  hence  the  Newton  algorithm  with  the  constraint  results  in 
similar  final  MSE  regardless  of  the  number  of  hidden  units.  However,  the  MSE  of  the  BP 
algorithm  with  the  constraint  actually  increases  with  the  number  of  hidden  units.  We  also 
observe  from  the  simulation  results  that,  after  some  recursive  steps,  the  MSE's  of  the  four 
algorithms  have  the  following  relationship: 

Newton  with  the  constraint 

<  Newton  without  the  constraint 

<  BP  with  the  constraint 

<  BP  without  the  constraint. 

To  conserve  space,  we  only  plot  those  MSE's  for  the  networks  with  6  hidden  units  in 
Figures  1-3;  the  MSE's  of  the  BP  algorithm  without  the  constraint  are  not  included  in 
the  figures  because  they  are  too  large  relative  to  MSE's  from  other  algorithms.  From 
these  figures,  it  can  be  seen  that  the  Newton  algorithm  without  the  constraint  behaves 
unstably  and  may  produce  very  large  errors  during  the  learning  period.  It  can  also  be 
seen  that  both  the  Newton  and  BP  algorithms  with  the  constraint  are  well  behaved,  but 
the  Newton  algorithm  results  in  much  lower  MSE  and  converges  much  quickly  than  does 
the  BP  algorithm.  These  results  clearly  show  the  superiority  of  the  proposed  algorithm. 

5      Conclusions 

In  this  paper  we  propose  a  recurrent  Newton  algorithm  which  extends  the  recurrent  BP  al- 
gorithm introduced  earlier  to  allow  for  a  Newton  search  in  the  parameter  space.  To  ensure 
proper  convergence  behavior,  a  constraint  must  be  imposed  to  prevent  recurrent  variables 
from  "exploding".  The  simulation  results  demonstrate  that  the  proposed  algorithm  with 
the  constraint  performs  uniformly  better  than  other  algorithms  in  terms  of  MSE. 
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Appendix 

An  Example  of  the  Recurrent  Newton  Algorithm:  For  notational  simplicity,  we 
consider  a  single-output  Elman  network  with  $  the  identity  function  and  $  the  logistic 
function: 

0t    =    b  +  (3'Ht  =  ft  +  ELiAfc*. 

ha    = 


l+expi-Ci-YiXt-S'iHt^) 
In  this  case,  R  =  H ,  A  =  ty ,  and 

W  =  [bP'cll[  ...cql'q6[  ...  S'J. 

The  updating  equations  ( 12)— ( 16)  can  be  easily  computed  from  the  following  information. 
Let  Xt  =  [1  X%\',  $  =  [b  /?']',  and  7,-  =  [ct  -)[]'.  The  network  error  in  (5)  is  computed  as 

E,    =    Yt-(bt  +  Y,qi=Jithit), 

hit     = 


1  +  exp(-ct<  -  YltXt  -  SuHt-i) 
In  (6),  the  vector  Fw  contains  the  following  sub-vectors: 

Ffi    =    [lH't]', 
F^     =    $ithit(l  -  hit)Xu 
Fsx     =    Pit'hitil  -  hit)Ht-i, 
for  i  =  1, . . . ,  q;  and  the  vector  F//  is 
Fh   =  V=Jithit(l-hit)6it. 

The  recurrent  variables  of  (9)  are  ha.    In  (10),  the  matrix  V\v  contains  the  following 
submatrices: 

*jj     =     0, 

*^     =    d\zg[hu(l-hlt)Xt  ■■■  hqt(l-hqt)Xt], 
Vs     =     diag[Ml -&„)#,_,   •••  hqt(l-  hqi)Ht_x}- 

and  the  matrix  $//  is 

%H   =     hu(l  -  hlt)Slt  ■■■  hqt(l  -  hqt)Sqt 

11 


The  following  values  may  be  used  to  initialize  the  algorithm:  the  elements  of  W\  are  ran- 
domly generated  from  some  random  number  generator,  G\  —  si  with  s  =  100/(^t  Vt/T) 
and  /  the  identity  matrix,  fin  =  1/2  for  all  i,  and  D\  =  0. 
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Table  1.  Summary  of  Simulation  Results. 


Model 

Hidden 

Newton  with  Const. 

Newton  w/o  Const. 

BP  with  Const. 

BP  w/o 

Const. 

Average 

Last 

Average 

Last 

Average 

Last 

Average 

Last 

Units 

MSE 

MSE 

MSE 

MSE 

MSE 

MSE 

MSE 

MSE 

Bi- 

4 

1.825 

1.824 

1.900 

1.839 

2.181 

2.154 

2.547 

2.533 

linear 

5 

1.820 

1.811 

1.902 

1.852 

2.139 

2.111 

2.579 

2.564 

6 

1.802 

1.789 

1.924 

1.867 

2.109 

2.078 

2.634 

2.615 

Henon 

4 

0.125 

0.125 

0.176 

0.162 

0.329 

0.324 

0.491 

0.488 

Map 

5 

0.101 

0.098 

0.159 

0.142 

0.324 

0.319 

0.527 

0.524 

6 

0.079 

0.079 

0.152 

0.158 

0.302 

0.296 

0.536 

0.530 

4 

1.014 

1.011 

1.038 

1.032 

1.059 

1.054 

1.115 

1.114 

SETAR 

5 

1.015 

1.011 

1.037 

1.030 

1.066 

1.060 

1.118 

1.114 

6 

1.017 

1.012 

1.042 

1.033 

1.072 

1.066 

1.191 

1.182 
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